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Breast cancer is one of the most common diseases diagnosed in women over 
the world. The balanced iterative reducing and clustering using hierarchies 
(BIRCH) has been widely used in many applications. However, clustering 
the patient records and selecting an optimal threshold for the hierarchical 
clusters still a challenging task. In addition, the existing BIRCH is sensitive 
to the order of data records and influenced by many numerical and 
functional parameters. Therefore, this paper proposes a unique BIRCH- 
based algorithm for breast cancer clustering. We aim at transforming the 
medical records using the breast screening features into sub-clusters to group 
the subject cases into malignant or benign clusters. The basic BIRCH 
clustering is firstly fed by a set of normalized features then we automate the 
threshold initialization to enhance the tree-based sub-clustering procedure. 
Additionally, we present a thorough analysis on the performance impact of 
tuning BIRCH with various relevant linkage functions and similarity 
measures. Two datasets of the standard breast cancer wisconsin (BCW) 
benchmarking collection are used to evaluate our algorithm. The 
experimental results show a clustering accuracy of 97.7% in 0.0004 seconds 
only, thereby confirming the efficiency of the proposed method in clustering 
the patient records and making timely decisions. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 
Ahmad Alzu'bi 


Department of Computer Science, Jordan University of Science and Technology 


Irbid 22110, Jordan 
Email: agalzubi@just.edu.jo 


1. INTRODUCTION 


Extracting meaningful information from the medical records to make proper early decisions is a 
demanding task and should be investigated meticulously. Many challenges are usually encountered in the 
procedure of diseases diagnosis and treatment due to the large amount of medical data generated by health 
monitoring systems and equipments. Among the most challenging factors are the diversity of disease 
characteristics, heterogeneity of treatment, complexity of data collection and processing, and interpretation of 
medical diagnostics generated from various media [1]-[3], i.e., audio, visual, image, and text content. 

Clustering is a simple and yet efficient unsupervised approache that assigns the data subjects into 
high similar groups, i.e., clusters. However, handling the underlying diversity of clustering analysis, 
objectives, terms, and assumptions of various clustering algorithms can be daunting [4], [5]. Therefore, there 
is a demand to neatly determine a correct congruence between the aggregation algorithms and the biomedical 
applications. Additionally, an adequate approach of data selection and clustering is crucial in the medical 
diagnosis, which usually requires a relevant knowledge and prior domain expertise. 
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Clustering feature tree (CF-tree) is one of the efficient and scalable data clustering methods based 
on a memory data structure and serves as a summary of data distribution. The CF-tree is the core mechanism 
of the hierarchical balanced iterative reducing and clustering using hierarchies (BIRCH) [6]. BIRCH can 
handle multi-dimensional data points dynamically or incrementally, and it ordinarily produces good 
clustering results in few data scans. Among the common hierarchical clustering approaches, BIRCH is 
effective in solving many real-life applications such as constructing iterative and interactive classifiers and 
forming codebooks for image retrieval and segmentation [7]-[9]. A clustering feature (CF) is represented as a 
node in BIRCH clustering tree, which demonstrates the underlying cluster of a specific point or multiple 
points. BIRCH considers the closeset points as one group where the CFs demonstrates this scale of 
abstraction. Generally, BIRCH method includes scanning the subjects to construct an in-memory features 
tree, rebuilding smaller CF trees, performing a global clustering, and clusters refinement. 

However, the downside of BIRCH algorithm is the sensitivity to the order of data records in the 
numerical attributes. Its performance also depends on several parameters including the branching factor Br, 
threshold T, and cluster count k. In BIRCH, a height-balanced CF tree of hierarchical clusters is built. A 
cluster is represented as a node where the leafs are the actual clusters. The branching factor Br limits the 
number of node's children. A new data point is added to the leaf cluster if the cluster radius does not exceed a 
defined threshold T. Otherwise, the new data point is assigned into a new empty cluster. 

A proper threshold selection is necessary to improve the accuracy of BIRCH, which also affects the 
size of clusters. Moreover, the BIRCH performance is largely influenced by the linkage methods, that used to 
construct the sub-clusters tree, and by the distance measures used to calculate the distance between the data 
points and the cluster centroids. Zhang et al. [6] have shown the superiority of BIRCH compared to the 
clustering large applications based on RANdomized search (CLARANS) [10] method. Ismael et al. [11] have 
also attempted to address the shortcomings of BIRCH using a single threshold initialization. The CF-tree is 
built with the restriction that the leaf entries must use a uniform threshold T while different thresholds are 
used to reconstruct the CF tree. Several studies [12]-[18] have also highlighted the impact of using multiple 
thresholds or single threshold either in BIRCH or other hierarchical clustering. Many research efforts have 
been devoted for clustering the breast cancer records. Vijayarani and Jothi [19] have evaluated the clustering 
performance and the outlier detection accuracy. They implemented the aggregation process in data flows and 
examined the extreme values in data flows using BIRCH with CLARANS and BIRCH with k-means. 
Chowdhary et al. [20] have investigated a hybrid fuzzy method to diagnose the breast cancer using the 
C-means clustering and support vector machines (SVM) algorithm. Lavanya and Palaniswami [21] have 
proposed assigning the data subjects to different classes using the principle of majority weighted minority 
oversampling technique. 

In this paper, an improved BIRCH variant is proposed by a three-fold paradigm: attributes 
preprocessing, threshold initialization, and evaluating several linkage and similarity measures. We aim at 
building an efficient hierarchical clustering to diagnose the patients of breast cancer, which maintains the 
time and storage constraints. We also investigate the impact of outlier patterns on the performance of BIRCH 
in terms of clustering accuracy and runtime complexity. The standard benchmarking datasets, breast cancer 
wisconsin [22] and breast cancer wisconsin (diagnostic) [23], are used to evaluate the proposed approach. 
The remaining part of this paper is organized as follows: section 2 illustrates this work methodology and the 
proposed algorithms; section 3 presents the experimental results with detailed discussion and comparisons; 
and section 4 concludes this paper. 


2. RESEARCH METHOD 
This section presents the conventional basic BIRCH algorithm, the proposed BIRCH-based 
clustering framework, datasets and performance evaluation protocol. 


2.1. The hierarchical birch 

The basic BIRCH algorithm consists of four main phases [6], [24]: i) loading data points into a CF 
tree to conduct an initial scanning on the dataset; ii) optionaly, building a smaller CF tree by condensing any 
resizable data or merging the crowded sub-clusters; iii) applying a global clustering on the CF data points 
through another clustering method, e.g., k-means; and iv) refining the clusters by correcting any inaccuracies 
in the CF tree. BIRCH requires initializing the number of branches on the CF-leaf and CF-non leaf. The 
location of a data point, i.e., patient record, is compared to the location of each clustering feature at the root 
node and passes it to the closest root node. The following are the essential parameters that largely influence 
the performance of BIRCH: 
- CF features: the number of data points (N) for a given data point (x), the linear sum of data points (LS), 

and the square sum of data points (SS). The latter two parameters are defined as (1) and (2): 
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- Centroid: It is derived from a CF and defined as (3): 
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- Radius (R): the average distance from any cluster data point to its centroid, and it is defined as (4): 


R= Ee ny" Ness— acl +N 
E = 2 
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- Diameter (D): the square root of the average mean squared distance between all pairs of the cluster 
datapoints, and it is defined as (5): 


D= [eM Eaei 2N*SS—2LS? 
z N 4A) N(N-1) (5) 


If two clusters, C7 and C2, are merged then the constructed CF would be the summation of corresponding 
parameters in the clusters, which is defined as (6): 


CF = CF1 + CF2 = (N1 + N2,LS1 + LS2,SS1 + ss2 (6) 


2.2. The framework of improved BIRCH 

Figure 1 demonstrates the sequence of phases involved in the proposed BIRCH for breast cancer 
clustering, and each phase is consecutively illustrated throughout this paper. Firstly, we will use the 
benchmarking medical datasets to preprocess the patient records and features by selecting the most relevant 
features and fitting them to the corresponding clusters labels (benign and malignant). Secondly, the threshold 
value is automatically initialized using a three-steps function that select a random subset of features. Any 
data outliers are also eliminated by rescaling the patient features. Data features are rescaled, i.e., normalized, 
into a new data space using the minimum/maximum values of all patients' records. Thirdly, we apply an 
ablation study on numerous linkage methods and similarity distance metrics. Finally, all the patients' records 
are predicted and assigned into a proper cluster. 
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Figure 1. A graphical depiction of the main phases involved in the improved BIRCH algorithm 


2.3. Data preprocessing 
Data records are preprocessed by selecting the most relevant features and fitting them into the 
corresponding clusters labels, i.e., benign and malignant. Additionally, any outliers are detected and 
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eliminated using features rescale. Our procedure of data preprocessing consists of two main phases: features 
selection and features rescale. A proper features selection facilitates the construction of clusters and reduces 
the data space, hence requiring less processing and storage. In our framework, the patient record is 
formulated into a vector of features x=[x;, ...,xi]. However, the redundant records are omitted using a min-max 
normalization. Then, the data are split into two groups where x=[x,,...,x;] represents the patient features and 
y=[y,... yi] represents the cluster, i.e., benign or malignant. 

We use the random sampling to collect data from the patient dataset in which all the records have an 
equal opportunity of being chosen. The size of selected data is empirically set to 50% of the whole records. 
Then, we pass this randomly selected sample to the automatic thresholding function. Finally, the matrix 
elements (F) are rescaled to generate the normalized features (F,„) as (7): 


Fn = scale [F,inputmin,V (Min), inputmax,V(Max)| (7) 


where, F represents the input features, V(Min) is the vector of minimum feature values, V(Max) is the vector 
of maximum feature values, inputmaxis the upper bounding limit of normalization interval, and inputmin is the 
lower bounding limit of normalization interval. This procedure projects the features into a new space within 
V(Min) and V(Max). Therefore, it rescales according to the size of input features that corresponds to the 
bounding limits, i.e., inpUtmin and inputmax. 


2.4. Automatic threshold initialization 

BIRCH clustering builds the CF-tree in which the leaf entries must meet a fixed threshold, but this 
usually produces a poor clustering quality. In our work, the threshold value is initialized automatically to 
improve the clustering accuracy and speed. Therefore, the threshold T is used in the CF-Leaf to store any 
changes on the used threshold. Our thresholding algorithm is inspired by the work introduced by Ridler and 
Calvard [25] in which they assign a threshold to separate the image pixels into classes. Correspondingly, we 
construct a matrix of patient features and generate a random optimal threshold. In BIRCH, each data point is 
assigned to the closest CF-leaf if the radius does not exceed the threshold T. Otherwise, this point is assigned 
to a new empty leaf. In contrast, we propose that the new data point that exceeds the threshold should be 
initialized automatically, thereby enlarging the radius scale on the leaf nodes and reducing the parent split. 
This process includes three steps. 
Step 1: Segments the feature matrices into two parts using an initial random threshold, i.e., T(1), as shown 

in algorithm 1. 


Algorithm 1. Threshold initialization and features split 
Input: sample points from dataset (I) selected randomly 
Output: initial threshold 
Begin 
N: random sample of features, I: features 
Counts: summation of elements, T: threshold 
cuSum1: cumulative summation, i=1 //counter for T 
1.1 Find the mean of N features 
T(1)=mean (I) 
Counts=features matrix (I) 
calculate the cuSuml1 of counts 
1.2 Round the result 
T(i)=sum(N.* counts) /cuSuml1 (end). 
end 


Step 2: Calculates a new threshold by averaging the means of two samples, as shown in algorithm 2. 


Algorithm 2. Calculating the mean values. 
Input: the mean of features 
Output: a new updated threshold 
begin 
MBT: mean below the current threshold 
MAT: mean above the current threshold 
Counts: summation of elements 
N: random sample of features, T: threshold 
2.1 calculate MBT 
MBT=sum(N(N<=T(i)) * counts (N<=T(i))) /cuSum2 (end) 
2.2 calculate MAT 
MAT=sum(N(N>T(i)) *counts(N>T(i))) /cuSum3 (end) 
2.3 T(4)=(MAT+MBT) /2 //new threshold 
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Step 3: Repeats step 2 until the threshold value does not change anymore, as shown in algorithm 3. 


Algorithm 3. Threshold selection 
Input: threshold, Output: optimal threshold 
begin 

3.1 repeat Algorithm 2 
while T(i)~=T(i-1) 

T(i)=features matrix 
While ABS (newT(i)-oldT(i-1))=1 do: 

3.2 cuSum2=cumsum (counts (N<=T(1) ) 
MBT=sum(N(N<K=T(i)) * counts (N<=T(i))) /cuSum2 (end) 
cuSum3=cumsum (counts (N>T (i) ) 
MAT=sum(N(N>T(i))*counts(N>T(i))) /cuSum3 (end) 
i=itl 

3.3 if T(i)~=T(i-1), repeat step 3.2 
T (i) =(MAT+MBT) /2 
T(i)=features matrixes 

end while 
end 


2.5. Linkage methods and similarity distances 

BIRCH calculates the distance between data points to join them into clusters iteratively. In binary 
clustering, each cluster is shaped by many observations and join methods on the data points and clusters. 
Therefore, we consider various linkage methods in our experiments as defined in Table 1. The cluster r is a 
join of clusters p and q, n, is the number of subjects in r, and x, is the ith subject in r. Table 2 also 
summarizes all the standard similarity distance metrics studied in this work. 


Table 1. The linkage methods examined in the proposed approach 


Method Description 
Single It is known as nearest neighbor, and employs the smallest distance between objects in two clusters. 

d(r,s) = min (dist(x,:,s))), ie (i, ...,n,), je(1, ns) (8) 
Complete It is known as farthest neighbor, and employs the largest distance between objects in two clusters. 

d(r,s) = max (dist(x,i,s;)) ,te(i,...,n,), jEC, ..., M5) (9) 
Ward It calculates the weighted squared Euclidean distance between the centroids of two clusters 

drs) = | PP Ie - Ble (10) 


Where: ||, — x, ||2 is the eculidean distance, x, and y, are the centroids of clusters r and s. 


n_and ņ are the number of elements in clusters r and s. 
r S 


Centroid It calculates the square of Euclidean distance between the centroids of two clusters 
d(r,s) =||x, -Xll d1) 
where 
a 
X, = Xi (12) 
n i=l 
Average It calculates the average distance between all pairs of objects in two clusters. 
1 Be 
d(r,s) =— PPV udist(x,.x, (13) 
NN, j=. j=l 
Median It employs the Euclidean distance between the weighted centroids of the two clusters Š, and X, F 
d(r,s)= | x, — Xl), (14) 
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Table 2. Similarity distance metrics 


Metric Description 
Euclidean (p=2) F 
Cityblock (p=1) d(r,s) =? ` Xxl (15) 


Chebychev (p=0o) a 


Squared Euclidean Squared Euclidean that is usually used for regression analysis. 


d(r,s)=> x, —x,{ (16) 
i=l 
StdEuclidean Standardized Euclidean that divides each squared discrepancy between attributes by the sample size. 
a (x= xa) (17) 
d(r,s)= X M 
i=l n 


Mahalanobis ; ; OOE ENE ; 3 ; 2 
The distance between the data point and the sample distribution using the covariance matrix, where 5: 


is the standard deviation. 


(18) 


2.6. Datasets and performance metrics 

Breast cancer wisconsin dataset (BCW) [22] consists of 11 attributes and 699 instances divided into 
different partitions. It includes the following features: record ID, clump thickness, the uniformity of cell, 
shape and size, marginal adhesion, normal nuclei, bare nuclei, epithelial cell size, bland chromatin, mitoses, 
and cluster label, i.e., 2 for benign and 4 for malignant. The patient ID is excluded from our experiments. 
Breast cancer wisconsin (diagnostic) dataset [23] consists of 31 attributes and 569 instances divided into 
different partitions. It includes the cluster label, i.e., M for malignant and B for benign, and 10 features 
calculated for each cell nucleus as follows: perimeter, area, radius (mean of distances), texture, smoothness 
(radius variation), compactness (perimeter?/area-1.0), concavity, concave points, symmetry, and fractal 
dimension. 

The proposed BIRCH variant is evaluated by the following performance metrics: true positives 
(TP), false positives (FP), false negatives (FN), true negatives (TN), accuracy, precision and recall. We also 
use F-measure (F-score) to make the precision and recall comparable in place of arithmetic mean by 
punishing the extreme values more. Additionally, fowlkes-mallows index (Fm-index) is used to find the 
dissimilarity between the final clusters. 


3. RESULTS AND DISCUSSION 

In this section, we demonstrate and discuss the experimental results obtained by the improved 
BIRCH clustering. Thresholds are automatically initialized after processing the features of medical records, 
and we also present the results obtained by the basic and improved BIRCH with relevant comparisons. 


3.1. Clustering results on BCW dataset 

Firstly, we discuss the clustering results obtained by the original BIRCH using a range of fixed 
thresholds: 0.2, 0.5, and 0.9. These thresholds are manually assigned within the range {0-1}. Table 3 
summarizes the best result recorded using a range of linkage and distance measures under a thorough 
experiments. It can be observed that the basic BIRCH achieved the best clustering performance using the 
ward linkage and Euclidean similarity distance. It is also performing with a threshold 0.2 better than other 
threshold values considered in our experiments, i.e., 0.5 and 0.9. 

On the other hand, our BIRCH variant outperforms the basic BIRCH over all methods using a 
randomly initialized threshold. Table 4 shows the clustering results of improved BIRCH on the BCW dataset. 
The improved BIRCH achieves 97.7% of clustering accuracy and improves the accuracy of the basic BIRCH 
by 4% and the recall by 6%. The accuracy results confirm the superiority of the improved BIRCH clustering 
using various linkage and similarity distances. The basic BIRCH only outperforms the improved version 
using the centroid linkage with Seuclidean similarity distance. However, both BIRCH versions reported the 
best performance using ward linkage and Euclidean. In terms of speed, the improved BIRCH is obviously 
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faster than the basic BIRCH under all the experimental configurations. It takes an average time of 0.0006 
seconds to complete the clustering process on the BCW dataset compared to an average time of 0.3723 
seconds in the basic BIRCH, which is also fast on the BCW (diagnostic) dataset. 


Table 3. Clustering results on BCW dataset using the basic BIRCH 


Linkage Distance Time(s) Recall TP TN FP FN Fm Accuracy Threshold 
Ward Euclidean 0.13 0.93 0.51 0.43 0.02 0.04 0.94 0.936 0.2 
Seuclidean 0.32 0.93 0.51 0.42 0.03 0.04 0.94 0.931 0.9 
SqrEuclidean 0.10 0.92 0.50 0.38 0.08 0.04 0.90 0.884 0.2 
Centroid Euclidean 0.38 0.99 0.55 0.00 0.45 0.00 0.74 0.548 0.2 
Seuclidean 0.11 0.43 0.51 0.42 0.03 0.03 0.93 0.928 0.2 
SqrEuclidean 0.98 0.93 0.50 0.41 0.05 0.04 0.92 0.915 0.2 
Average Euclidean 0.58 0.92 0.52 0.01 0.06 0.01 0.90 0.889 0.2 
Seuclidean 0.007 0.94 0.52 0.07 0.38 0.03 0.74 0.585 0.5 
SqrEuclidean 0.94 0.92 0.50 0.38 0.07 0.05 0.90 0.886 0.2 
Single Euclidean 0.01 0.99 0.54 0.01 0.44 0.01 0.74 0.548 0.2 
Seuclidean 0.001 0.99 0.55 0.01 0.42 0.02 0.74 0.548 0.2 
SqrEuclidean 0.91 0.99 0.54 0.01 0.44 0.01 0.74 0.548 0.2 


Table 4. Clustering results on BCW dataset using the improved BIRCH 


Linkage Distance Time (s) Recall TP TN FP EN Fm Accuracy Threshold 
Ward Euclidean 0.0004 0.99 0.52 0.44 0.02 0.04 0.96 0.977 0.38 
Seuclidean 0.0002 0.99 0.51 0.43 0.03 0.03 0.95 0.949 0.48 
SqrEuclidean 0.0002 1.00 0.47 0.40 0.05 0.07 0.89 0.937 0.47 
Centroid Euclidean 0.0006 1.00 0.55 0.00 0.45 0.00 0.74 0.656 0.44 
Seuclidean 0.0009 1.00 0.54 0.01 0.45 0.01 0.74 0.655 0.48 
SqrEuclidean 0.0010 0.98 0.51 0.43 0.03 0.03 0.94 0.967 0.47 
Average Euclidean 0.0007 0.98 0.51 0.42 0.04 0.03 0.93 0.962 0.44 
Seuclidean 0.0009 0.99 0.52 0.43 0.02 0.03 0.95 0.969 0.38 
SqrEuclidean 0.0005 0.99 0.51 0.43 0.02 0.04 0.94 0.969 0.45 
Single Euclidean 0.0006 1.00 0.53 0.01 0.45 0.01 0.74 0.656 0.44 
Seuclidean 0.0007 1.00 0.55 0.00 0.45 0.00 0.74 0.656 0.44 
SqrEuclidean 0.0008 1.00 0.55 0.00 0.45 0.00 0.74 0.656 0.37 


3.2. Clustering results on BCW (diagnosis) dataset 

Table 5 summarizes the clustering results obtained after applying the best configuration of the basic 
and improved BIRCH on the BCW (diagnostic) dataset. Obviously, our BIRCH variant outperforms the basic 
one by an accuracy of 93.3% compared to 65.5% under the same setups. Also, the average clustering time of 
the improved BIRCH is about 0.0008 second compared to 0.6424 second taken by the basic BIRCH. 


Table 5. Clustering results on the BCW (diagnosis) dataset 


Method Time(s) Recall TP TN FP FN Fm Accuracy Threshold 
Basic BIRCH 0.6420 0.873 0.465 0.189 0.278 0.067 0.739 0.655 0.200 
Improved BIRCH 0.0008 0.969 0.478 0.398 0.070 0.054 0.884 0.933 0.561 


3.3. Clustering hierarchical relationship 

Figures 2(a) and 2(b) depict the patients’ clusters of breast cancer using the improved BIRCH 
compared to the basic version obtained by the best configuration, i.e., ward linkage and Euclidean distance. 
As shown in Figure 2(a), two clusters (benign and malignant) of breast cancer records are represented by 
rescaled features in the improved BIRCH and optimally predicted using a random threshold of 0.38. It can be 
observed that the overlapping features at the cluster borderlines are minimized by our BIRCH variant. The 
BIRCH clusters are also visualized using the dendrogram [26] which depicts the hierarchical relationship 
between the dataset records, i.e., cluster objects. It is used as common representation of the hierarchical 
clustering, as shown in Figures 3(a) and 3(b). All the data points are shown at the bottom of the dendrogram. 
Each point or subject is assigned to separate clusters and any two close clusters are merged to shape a final 
cluster at the top. The height in the dendrogram is the similarity distance between two clusters in the data 
space. The highest mean and median Fm scores were obtained for the basic BIRCH and improved BIRCH 
using a threshold 0.2 and a random threshold 0.38, respectively. It can be observed that the clusters merge in 
the improved BIRCH is better than the basic one in showing which clusters are very similar. 
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Figure 3. Th dendrogram plots of clustering, (a) improved BIRCH and (b) basic BIRCH 


3.4. Precision and recall evaluation 

We consider here how precision determines the clinical sensitivity, i.e., fraction of true positives to 
all with breast cancer, and the clinical specificity, i.e., fraction of true negatives to all without breast cancer. 
Table 6 summarizes the results of improved BIRCH on the datasets. The reported results are approaching 
100% precision and 100% recall on both datasets, which confirms the stability of clustering algorithm. We 
also underline the importance of measuring the recall and precision at the same time using the F-score [27], 
as shown on Figure 4(a). Obviously, the improved BIRCH achieves higher F-scores than the basic BIRCH. A 
sample of breast tumors diagnosed as benign or malignant is demonstrated in Figure 4(b). 


Table 6. Precision-recall results 


Precision Recall 
BCW BCWD BCW BCWD 
Ward+Euclidean 0.992 0.977 0.996 0.989 
Ward+Seuclidean 0.992 0.944 0.995 0.969 
Ward+ SqrEuclidean 1.000 0.985 1.000 0.997 


3.5. Comparisons with related works 
As shown in Table 7, we compare the performance of our proposed BIRCH algorithm in terms of 
accuracy, precision, and recall with the most two related clustering works examined on the same dataset, i.e., 


Automatic BIRCH thresholding with features transformation for ... (Ahmad Alzu'bi) 


1506 O ISSN: 2088-8708 


BCW. It can be obviously observed that our BIRCH algorithm outperforms the other approaches in all the 
performance metrics, which emphasizes its high capability in clustering the breast cancer records. 
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Figure 4. The performance of improved BIRCH in terms of F-score, (a) F-score results on BCW and 
diagnostic and (b) diagnosed breast tissues [28] 


Table 7. Performance comparison with the related approaches 


Method Accuracy Precision Recall 
BIRCH+K-mean [19] 0.704 0.748 0.768 
BIRCH+CLARNS [19] 0.764 0.760 0.760 
BIRCH+Mwmote [21] 0.969 0.940 0.970 
This paper 0.977 0.995 0.991 


4. CONCLUSION 

In this paper, we have improved the capability of the hierarchical BIRCH aggregation algorithm in 
clustering the medical records of breast cancer patients. The experimental results emphasize the superiority 
of the improved BIRCH over the basic BIRCH with efficient features selection, data rescaling, automatic 
threshold initialization, linkage methods and distances metrics. We demonstrated that a proper data 
preprocessing improves the BIRCH performance. Additionally, our proposed automatic thresholding largely 
increases the quality of generated clusters. Also, the impact of binding methods on the complexity of tree 
subgroups, i.e., subclustering, is highlighted. We achieved a clustering accuracy of 97.7% with 
discriminating clusters better than the original BIRCH. In future, the proposed BIRCH could be further 
optimized by passing the cluster centroids to another clustering algorithm, e.g., k-means. This procedure 
could be adopted in a sequential or parallel manner, i.e., various representations. 
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